Search CORE

24 research outputs found

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

Author: Cannici Marco
Matteucci Matteo
Plizzari Chiara
Publication venue
Publication date: 11/12/2020
Field of study

Skeleton-based human action recognition has achieved a great interest in recent years, as skeleton data has been demonstrated to be robust to illumination changes, body scales, dynamic camera views, and complex background. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data on both NTU-RGB+D 60 and NTU-RGB+D 120.Comment: Accepted as ICPRW2020 (FBE2020, Workshop on Facial and Body Expressions, micro-expressions and behavior recognition) 8 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:2008.0740

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations

Author: Caputo Barbara
Damen Dima
Perrett Toby
Plizzari Chiara
Publication venue
Publication date: 24/08/2023
Field of study

We propose and address a new generalisation problem: can a model trained for action recognition successfully classify actions when they are performed within a previously unseen scenario and in a previously unseen location? To answer this question, we introduce the Action Recognition Generalisation Over scenarios and locations dataset (ARGO1M), which contains 1.1M video clips from the large-scale Ego4D dataset, across 10 scenarios and 13 locations. We demonstrate recognition models struggle to generalise over 10 proposed test splits, each of an unseen scenario in an unseen location. We thus propose CIR, a method to represent each video as a Cross-Instance Reconstruction of videos from other domains. Reconstructions are paired with text narrations to guide the learning of a domain generalisable representation. We provide extensive analysis and ablations on ARGO1M that show CIR outperforms prior domain generalisation works on all test splits. Code and data: https://chiaraplizz.github.io/what-can-a-cook/.Comment: Accepted at ICCV 2023. Project page: https://chiaraplizz.github.io/what-can-a-cook

arXiv.org e-Print Archive

Domain generalization through audio-visual relative norm alignment in first person action recognition

Author: Alberti Emanuele
Caputo Barbara
Planamente Mirco
Plizzari Chiara
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

First person action recognition is becoming an increasingly researched area thanks to the rising popularity of wearable cameras. This is bringing to light cross-domain issues that are yet to be addressed in this context. Indeed, the information extracted from learned representations suffers from an intrinsic "environmental bias". This strongly affects the ability to generalize to unseen scenarios, limiting the application of current methods to real settings where labeled data are not available during training. In this work, we introduce the first domain generalization approach for egocentric activity recognition, by proposing a new audiovisual loss, called Relative Norm Alignment loss. It rebalances the contributions from the two modalities during training, over different domains, by aligning their feature norm representations. Our approach leads to strong results in domain generalization on both EPIC-Kitchens-55 and EPIC-Kitchens-100, as demonstrated by extensive experiments, and can be extended to work also on domain adaptation settings with competitive results

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PoliTO-IIT Submission to the EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition

Author: Alberti Emanuele
Caputo Barbara
Planamente Mirco
Plizzari Chiara
Publication venue
Publication date: 01/01/2021
Field of study

In this report, we describe the technical details of our submission to the EPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action Recognition. To tackle the domain-shift which exists under the UDA setting, we first exploited a recent Domain Generalization (DG) technique, called Relative Norm Alignment (RNA). It consists in designing a model able to generalize well to any unseen domain, regardless of the possibility to access target data at training time. Then, in a second phase, we extended the approach to work on unlabelled target data, allowing the model to adapt to the target distribution in an unsupervised fashion. For this purpose, we included in our framework existing UDA algorithms, such as Temporal Attentive Adversarial Adaptation Network (TA3N), jointly with new multi-stream consistency losses, namely Temporal Hard Norm Alignment (T-HNA) and Min-Entropy Consistency (MEC). Our submission (entry 'plnet') is visible on the leaderboard and it achieved the 1st position for 'verb', and the 3rd position for both 'noun' and 'action'.Comment: 3rd place in the 2021 EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognitio

arXiv.org e-Print Archive

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations

Author: Caputo Barbara
Damen Dima
Perrett Toby J
Plizzari Chiara
Publication venue
Publication date: 06/10/2023
Field of study

Explore Bristol Research

EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge: Mixed Sequences Prediction

Author: Caputo Barbara
Nasirimajd Amirshayan
Peirone Simone Alberto
Plizzari Chiara
Publication venue
Publication date: 24/07/2023
Field of study

This report presents the technical details of our approach for the EPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action Recognition. Our approach is based on the idea that the order in which actions are performed is similar between the source and target domains. Based on this, we generate a modified sequence by randomly combining actions from the source and target domains. As only unlabelled target data are available under the UDA setting, we use a standard pseudo-labeling strategy for extracting action labels for the target. We then ask the network to predict the resulting action sequence. This allows to integrate information from both domains during training and to achieve better transfer results on target. Additionally, to better incorporate sequence information, we use a language model to filter unlikely sequences. Lastly, we employed a co-occurrence matrix to eliminate unseen combinations of verbs and nouns. Our submission, labeled as 'sshayan', can be found on the leaderboard, where it currently holds the 2nd position for 'verb' and the 4th position for both 'noun' and 'action'.Comment: 2nd place in the 2023 EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognitio

arXiv.org e-Print Archive

An Outlook into the Future of Egocentric Vision

Author: Bansal Siddhant
Damen Dima
Farinella Giovanni Maria
Furnari Antonino
Goletto Gabriele
Plizzari Chiara
Ragusa Francesco
Tommasi Tatiana
Publication venue
Publication date: 14/08/2023
Field of study

What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.Comment: We invite comments, suggestions and corrections here: https://openreview.net/forum?id=V3974SUk1

arXiv.org e-Print Archive

Skeleton-based action recognition via spatial and temporal transformer networks

Author: Cannici Marco
Matteucci Matteo
Plizzari Chiara
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

Skeleton-based Human Activity Recognition has achieved great interest in recent years as skeleton data has demonstrated being robust to illumination changes, body scales, dynamic camera views, and complex background. In particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated to be effective in learning both spatial and temporal dependencies on non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem, especially when it comes to extracting effective information from joint motion patterns and their correlations. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network, whose performance is evaluated on three large-scale datasets, NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400, consistently improving backbone results. Compared with methods that use the same input data, the proposed ST-TR achieves state-of-the-art performance on all datasets when using joints' coordinates as input, and results on-par with state-of-the-art when adding bones information.Comment: Accepted at Computer Vision and Image Understanding (CVIU) 12 pages, 8 figure

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Spatial Temporal Transformer Network for Skeleton-based Action Recognition

Author: Chiara Plizzari
Marco Cannici
matteo Matteucci
Publication venue: Springer International Publishing
Publication date: 01/01/2021
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Test-Time Adaptation for~Egocentric Action Recognition

Author: Barbara Caputo
Chiara Plizzari
Mirco Planamente
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)